Using HDDT to avoid instance propagation in unbalanced and evolving data streams

نویسندگان

  • Andrea Dal Pozzolo
  • Reid Johnson
  • Olivier Caelen
  • Serge Waterschoot
  • Nitesh V Chawla
  • Gianluca Bontempi
چکیده

Hellinger distance has been successfully used as a tree splitting criterion in Hellinger Distance Decision Trees [10] (HDDT) for unbalanced static datasets. In unbalanced data streams, state-of-the-art techniques use instance propagation and standard decision trees to cope with the unbalanced problem. However it is not always possible to revisit/store old instances of a stream. We solve this problem using HDDT in data streams. In this paper we show how HDDT can be successfully applied in unbalanced and evolving stream data, leading to improved predictive accuracy and speed. We then use a Hellinger weighted ensemble of HDDTs to combat concept drift and increase accuracy. We test our framework on several streaming datasets with unbalanced classes and concept drift. HDDT, Hellinger distance, Unbalanced data, Streaming data, Concept drift, Fraud detection.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Classification of encrypted traffic for applications based on statistical features

Traffic classification plays an important role in many aspects of network management such as identifying type of the transferred data, detection of malware applications, applying policies to restrict network accesses and so on. Basic methods in this field were using some obvious traffic features like port number and protocol type to classify the traffic type. However, recent changes in applicat...

متن کامل

Complexity curve: a graphical measure of data complexity and classifier performance

HDDT breast-y 286 9 2 2.36 HDDT compustat 13657 20 2 25.26 HDDT covtype 38500 10 2 13.02 HDDT credit-g 1000 20 2 2.33 HDDT estate 5322 12 2 7.37 HDDT german-numer 1000 24 2 2.33 HDDT heart-v 200 13 2 2.92 HDDT hypo 3163 25 2 19.95 HDDT ism 11180 6 2 42.00 HDDT letter 20000 16 2 24.35 HDDT oil 937 49 2 21.85 HDDT page 5473 10 2 8.77 HDDT pendigits 10992 16 2 8.63 HDDT phoneme 5404 5 2 2.41 HDDT ...

متن کامل

The Health Policy Process in Vietnam: Going Beyond Kingdon’s Multiple Streams Theory; Comment on “Shaping the Health Policy Agenda: The Case of Safe Motherhood Policy in Vietnam”

This commentary reflects upon the article along three broad lines. It reflects on the theoretical choices and omissions, particularly highlighting why it is important to adapt the multiple streams framework (MSF) when applying it in a socio-political context like Vietnam’s. The commentary also reflects upon the analytical threads tackled by Ha et al; for instance, it highlights the opportunitie...

متن کامل

IBLStreams: a system for instance-based classification and regression on data streams

This paper presents an approach to learning on data streams called IBLStreams. More specifically, we introduce the main methodological concepts underlying this approach and discuss its implementation under the MOA software framework. IBLStreams is an instance-based algorithm that can be applied to classification and regression problems. In comparison to model-based methods for learning on data ...

متن کامل

Increasing Skew Insensitivity of Decision Trees with Hellinger Distance

Learning from unbalanced datasets presents a convoluted problem in which traditional learning algorithms typically perform poorly. The heuristics used in learning tend to favor the larger, less important classes in such problems. While other methods, like sampling, have been introduced to combat imbalance, these tend to be computationally expensive. This paper proposes Hellinger distance as a m...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014